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Abstract 

Web page ranking and collaborative filtering require the optimization of sophisticated 
performance measures. Current Support Vector approaches are unable to optimize them 
directly and focus on pairwise comparisons instead. We present a new approach which 
allows direct optimization of the relevant loss functions. 

This is achieved via structured estimation in Hilbcrt spaces. It is most related to Max- 
Margin-Markov networks optimization of multivariate performance measures. Key to our 
approach is that during training the ranking problem can be viewed as a linear assignment 
problem, which can be solved by the Hungarian Marriage algorithm. At test time, a sort 
operation is sufficient, as our algorithm assigns a relevance score to every (document, query) 
pair. Experiments show that the our algorithm is fast and that it works very well. 

1 Introduction 

Ranking, and web-page ranking in particular, has long been a fertile research topic of machine 
learning. It is now commonly accepted that ranking can be treated as a supervised learning 
problem, leading to better performance than using one feature alone [Burges et al., 2005, Cao 
et al., 2006]. Learning to rank can be viewed as an attempt of learning an ordering of the 
data (e.g. web pages). Although ideally one might like to have a ranker that learns the partial 
ordering of all the matching web pages, users are most concerned with the topmost (part of 
the) results returned by the system. See for instance [Cao et al., 2006] for a discussion. 

This is manifest in corresponding performance measures developed in information retrieval, 
such as Normalized Discounted Cumulative Gain (NDCG), Mean Reciprocal Rank (MRR), Pre- 
cision©^ or Expected Rank Utility (ERU). They are used to address the issue of evaluating 
rankers, search engines or recommender sytems [Voorhees, 2001, Jarvelin and Kekalainen, 2002, 
Breese et al., 1998, Basilico and Hofmann, 2004]. 

Ranking methods have come a long way in the past years. Beginning with vector space 
models [Salton, 1971, Salton and McGill, 1983], various feature based methods have been pro- 
posed [Lee et al., 1997]. Popular set of features include BM25 [Robertson et al., 1994] or its 
variants [Robertson and Hull, 2000]. Following the intent of Richardson et al. [2006] we show 
that when combining such methods with machine learning, the performance of the ranker can 
be increased significantly. 

Over the past decade, many machine learning methods have been proposed. Ordinal regres- 
sion [Herbrich et al., 2000, Chu and Keerthi, 2005] using a SVM-like large margin method and 
Neural Networks [Burges et al., 2005] were some of the first approaches. This was followed by 
Perceptrons [Crammer and Singer, 2002], and online methods, such as [Crammer and Singer, 
2005, Basilico and Hofmann, 2004]. The state of the art is essentially to describe the partial 
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order by a directed acyclic graph and to associate the cost incurred by ranking incorrectly with 
the edges of this graph. These methods aim at finding the best ordering function over the 
returned documents. However, it is difficult to express complex (yet commonly used) measures 
in this framework. 

Only recently two theoretical papers [Rudin, 2006, Cossock and Zhang, 2006] discuss the 
issue of learning ranking with preference to top scoring documents. However, the cost function 
of [Rudin, 2006] is only vaguely related to the cost function used in evaluating the performance 
of the ranker. [Cossock and Zhang, 2006] argue that, in the limit of large samples, regression 
on the labels may be sufficient. 

Our work uses the framework of Support Vector Machines for Structured Outputs [Tsochan- 
taridis et al., 2005, Joachims, 2005, Taskar et al., 2004] to deal with the inherent non-convexity 
of the performance measures in ranking. Due to the capacity control inherent in kernel methods 
it generalizes well to test observations [Scholkopf and Smola, 2002] . The optimization problem 
we propose is very general: it covers a broad range of existing criteria in a plug-and-play manner. 
It extends to position-dependent ranking and diversity-based scores. 

Of particular relevance are two recent papers [Joachims, 2005, Burges et al., 2007] to address 
the complication of the information retrieval loss functions. More specifically, [Joachims, 2005] 
shows that two ranking-related scores, Precision@n and the Area under the ROC curve, can be 
optimized by using a variant of Support Vector Machines for Structured Outputs (SVMStruct). 
We use a similar strategy in our algorithm to obtain a Direct Optimization of Ranking Measures 
(DORM) using the inequalities proposed in [Tsochantaridis et al., 2005]. [Burges et al., 2007] 
considered a similar problem without the convex relaxation and instead they optimize the 
nonconvex cost functions directly by only dealing with their gradients. 

Outline: After a summary of structured estimation we discuss performance measures in in- 
formation retrieval (Section 3) and we express them as inner products. In Section 4 we compute 
a convex relaxation of the performance criterion and show how it can be solved efficiently using 
linear assignment. Experiments on web search and collaborative filtering show that DORM is 
fast and works well. 

2 Structured Estimation 

In the following we will develop a method to rank objects (e.g. documents d) subject to some 
query q by means of some function g(d,q). Obviously we want to ensure that highly relevant 
documents will have a high score, i.e. a large value of g. At the same time, we want to ensure 
that the ranking obtained is optimal with respect to the relevant ranking score. For instance 
for NDCG@10, i.e. a score where only the first 10 retrieved documents matter, it is not very 
important what particular values of a score g will assign to highly irrelevant pages, provided 
that they remain below the acceptance threshold. 

Obviously, we could use engineering skills to construct a reasonable function (PageRank is 
an example of such a function). However, we can also use statistics and machine learning to 
guide us find a function that is optimized for this purpose. This leads to a more optimized 
way of finding such a function, removing the need for educated guesses. The particular tool we 
use is max margin structured estimation, as described in Tsochantaridis et al. [2005]. See the 
original reference for a more detailed discussion. 

2.1 Problem Setting 

Large margin structured estimation, as proposed by [Taskar et al., 2004, Tsochantaridis et al., 
2005] , is a general strategy to solve estimation problems of mapping X — > Z by finding related 
optimization problems. More concretely it solves the estimation problem of finding a matching 
z £ Z from a set of (structured) estimates, given patterns x £ X, by finding a function f(x, z) 
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such that 



z*(x) := argmax f(x, z). 



(1) 



This means that instead of finding a mapping X — > Z directly, the problem is reduced to finding 
a real valued function on X x Z. 

In the ranking case, x G X corresponds to a set of documents with a corresponding query, 
whereas z would correspond to the permutation which orders the documents such that the most 
relevant ones are ranked first. Consequently / will be a function of the documents, query, and 
a permutation. It is then our goal to find such an / that it is maximized for the "correct" 
permutation. 

To assess how well the estimate z*(x) performs we need to introduce a loss function A(y, z), 
depending on z and some reference labels y, which determines the loss at z. For instance, if 
we want to solve the binary classification problem y, z £ {±1}, where y is be the observed 
label and z the estimate we could choose A(y,z) = 1 — 5y )Z . That is, we incur a penalty of 
1 if we make a mistake, and no penalty if we estimate y = z. In the regression case, this 
could be A(y, z) = (y — z) 2 . Finally, in the sequence annotation case, where both y and z are 
binary sequences, [Taskar et al., 2004, Tso chant aridis et al., 2005] use the Hamming loss. In 
the ranking case, which we will discuss in Section 3, the loss A will correspond to the relative 
regret incurred by ranking documents in a suboptimal fashion with respect to WTA, MRR, 
DCG, NDCG, ERU or a similar criterion. Moreover y will correspond to the relevance scores 
assigned to various documents by reference users. 

In summary, it is our goal to find some function / to minimize the error incurred by / on a 
set of observations X = {x±, . . . , x m } and reference labels Y = {yi, . . . , y m } 



We will refer to R emp [f,X,Y] as the empirical risk. Direct minimization of the latter with 
respect to / is difficult: 

• It is a highly nonconvex optimization problem. This makes practical optimization ex- 
tremely difficult, as the problem has many local minima. 

• Good performance with respect to the empirical risk R e m P [f, X, Y] does not result in good 
performance on an unseen test set. In practice, strict minimization of the empirical risk 
virtually ensures bad performance on a test set due to overfitting. This issue has been 
discussed extensively in the machine learning literature (see e.g. [Vapnik, 1982]). 

To deal with the second problem we will add a regularization term (here a quadratic penalty) 
to the empirical risk. To address the first problem, we will compute a convex upper bound on 
the loss A(yj,argmax 2gZ /(xi,z)). 

2.2 Convex Upper Bound 

The key inequality we exploit in obtaining a convex upper bound on the problem of minimizing 
the loss A(y, z) is the following lemma, which is essentially due to [Tsochantaridis et al., 2005]. 

Lemma 1 Let / :lxZ-tl and A^xZ^M and let zq £ Z. Moreover let ^Gl. In this 
case, if £ satisfies 



f(x, zq) - f(x, z) > A(y, z) - A(y, z ) - £ for all z G X, 
then £ > A(y, argmax 2gZ f(x, z)) — A(y,zo). Moreover, the constraints on £ and f are linear. 



m 




(2) 
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Proof Linearity is obvious, as £ and / only appear as individual terms. To see the first claim, 
denote by z*(x) := argmax 2gZ f(x, z). Since the inequality needs to hold for all z, it holds in 
particular for z*(x). This implies 

> f(x, z ) - f(x, z*(x)) > A(y, z*(x)) - A(y, z ) - £. 

The first inequality holds by construction of z*(x). Rearrangement proves the claim. ■ 

Typically one chooses zq to be the minimizer of A and one assumes that the loss for zq vanishes. 
In this case £ > A(z*). Note that this convex upper bound is tight for zq = z* and if the minimal 
£ satisfying this inequality is chosen. 



2.3 Kernels 

The last ingredient to obtain a convex optimization problem is a suitable function class for /. 
In principle, any class, such as Decision Trees, Neural Networks, convex combinations of weak 
learners as they occur in Boosting, etc. would be acceptable. For convenience we choose / via 

f(x,z) = (*(x,z),w). (3) 

Here <£(x, z) is a feature map and to is a corresponding weight vector. The advantage of 
this formulation is that by choosing different maps $ it becomes possible to incorporate prior 
knowledge efficiently Moreover, it is possible to express the arising optimization and estimation 
problem in terms of the kernel functions 

k((x,z),(x',z')) :=(^(x,z),^(x',z')) (4) 

without the need to evaluate z) explicitly. This allows us to work with infinite-dimensional 
feature spaces while keeping the number parameters of the optimization problem finite. More- 
over, we can control the complexity of / by keeping ||iy|| sufficiently small (for binary classifi- 
cation this amounts to large margin classification). 

Denote by Z{ the value of z for which A(?/j,z) is minimized. Moreover, let C > be a 
regularization constant specifying the trade-off between empirical risk minimization and the 
quest for a "simple" function /. Combining (2), Lemma 1, and (3) one arrives at the following 
optimization problem: 

j m 

minimize - \\w\\ 2 + C £j (5a) 
i=i 

subject to (w, $(xi, Zi) - $(xj, z)) > A(yj, z) - & (5b) 
for all & > and z£Z and % 6 { 1 , . . . , m} . 

In the ranking case we will assume (without loss of generality) that the documents are already 
ordered in decreasing order of relevance. In this case zi will correspond to the unit permutation 
which leaves the order of documents unchanged. 

2.4 Optimization 

One may show [Taskar et al., 2004] that the solution of (5) is given by 

f(x',z / ) = ^2a, z k((x l ,z),(x',z')). (6) 

i,z 
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Algorithm 1 Column Generation 

Input: data Xj, labels yi, sample size m, tolerance e 

Initialize Si = for all i, and u; = 0. 

repeat 

for i = 1 to m do 

z* = argmax 2eZ (w, $(xj, z)) + A(y,, z) 
£ = max(0, max 2eS , (w, $(xj, z)) + A(y h z)) 
if (w, $(x i5 z*)) + A( yi , z) > £ + e then 
Increase constraint set Si <— Si U z* 
Optimize (7) using only ojj 2 where z G Si. 
end if 
end for 

until S has not changed in this iteration 



This fact is also commonly referred to as the Representer Theorem [Scholkopf and Smola, 2002]. 
The coefficients ai z are obtained by solving the dual optimization problem of (5): 

minimize - ^ a iz a jz >k{{xi, z), (xj, z 1 )) - ^ A(y i} z)a iz (7a) 

i,j,z,z' i,z 

subject to a iz < C and Qj 2 > for all i and z. (7b) 

z 

Solving the optimization problem (7) presents a formidable challenge. In particular, for large 
Z (e.g. the space of all permutations over a set of documents) the number of variables is pro- 
hibitively large and it is essentially impossible to find an optimal solution within a practical 
amount of time. Instead, one may use column generation [Tsochantaridis et al., 2005] to find 
an approximate solution in polynomial time. The key idea in this is to check the constraints 
(5b) to find out which of them are violated for the current set of parameters and to use this 
information to improve the value of the optimization problem. That is, one needs to find 

a,rgma,xA(yi,z) + (w,$>{xi,z)) , (8) 

zez 

as this is the term for which the constraint (5b) becomes tightest. If Z is a small finite set of 
values, this is best achieved by brute force evaluation. For binary sequences, one often uses 
dynamic programming. In the case of ranking, where Z is the space of permutations, we shall 
see that (8) can be cast as a linear assignment problem. 

The Algorithm 1 has good convergence properties. It follows from [Tsochantaridis et al., 
2005] that it terminates after adding at most max [2nA/e,8CAR 2 /e 2 ] steps, where A is an 
upper bound on the loss A(y.;, z) and R is an upper bound on ||$(xj, z)\\. 

To adapt the above framework to the ranking setting, we need to address three issues: a) 
we need to derive a loss function A for ranking, b) we need to develop a suitable feature map 
$ which takes document collections, queries, and permutations into account, and c) we need to 
find an algorithm to solve (8) efficiently. 

3 Ranking and Loss Functions 
3.1 A Formal Description 

For efficiency, commercial rankers, search engines, or recommender systems, usually employ 
a document-at-a-time approach to answer a query q: a list of candidate documents is evalu- 
ated (while retaining a heap of the n top scoring documents) by evaluating the relevance for 
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Variable 


Meaning 


x = {q,D) 


document-query pair 




i-th query 


k 


number of documents for qi 


A = {da-, • • • , d^} 


documents for qi 


fij ^ [0) • • • j '"max] 


relevance of document de- 


Vi = {ru, . . .,r Ui } 


reference label 




global scoring function 


g(qi,dij) 


individual scoring function 


m 


number of queries for training 



a (document, query)-pari one at a time. For this purpose a score function g{d,q) is needed, 
which assigns a score to every document given the query. 1 Performance of the ranker is typ- 
ically measured by means of a set of labels y := {r±, . . . , 77} with G [0, . . . , r max ], where 
corresponds to 'irrelevant' and r max corresponds to 'highly relevant'. Training instances contain 
document query pairs that are labelled by experts. Such data for commercial search engines or 
recommender systems often have less than ten levels of relevance. 

At training time we are given m instances of queries document collections Di of cardinality 
li and labels yi with \yi\ = |A|- In the context of the previous section a set of documents Di in 
combination with a matching query qi will play the role of a pattern, i.e. Xi := (qi,di q , . . . , d^). 
Likewise the reference labels r^- G y% consist of the corresponding expert ratings for the docu- 
ments dij. 

We want to find some mapping /, such that the ordering of a new collection of documents 
d±, . . . , di obtained by sorting g(di, q) agrees well with y in expectation. We would like to obtain 
a single ranking which will perform well on the query for a given performance measure, unlike 
[Matveeva et al., 2006] who use a cascade of rankers. 

Note that there is also a processing step associated with ranking documents: for each doc- 
ument query pair (d, q), we have to construct a feature vector x. In this paper, we assume that 
the feature vector is given, and also use (d, q) to mean x. For instance BM25 [Robertson et al., 
1994, Robertson and Hull, 2000], date, click-through logs [Joachims, 2002] have proved to be 
an effective set of features. 

Many widely-used performance measures in the information retrieval community are irre- 
ducibly multivariate and permutation based. By permutation-based, we mean that the perfor- 
mance measure can be computed by comparing the two sets of ordering. For example, 'Winner 
Takes All' (WTA), Mean Reciprocal Rank (MRR) [Voorhees, 2001], Mean Average Precision 
(MAP), and Normalized Discounted Cumulative Gain (NDCG) [Jarvelin and Kekalainen, 2002] 
all fullfil this property. 

It is our goal to find a suitable permutation w(D,q,g) obtained for the collection D of 
documents di given the query q and the scoring function g. We will drop the arguments D, q, g, 
wherever it is clear from the context. Moreover, given a vector v 6 M m we denote by v{ir) 
the permutation of v according to ir, i.e. v(ir)i = v^uy Finally, without loss of generality (and 
for notational convenience) we assume that y is sorted in descending order, i.e. most relevant 
documents first. That is, the identical permutation it =1 will correspond to the sorting which 
returns the most relevant documents first with respect to the reference labeling. 

Note that it will play the role of z of Section 2. Likewise we will denote by II the space of 
all permutations (i.e. Z = LT). 

1 For computational efficiency (not for theoretical reasons) it is not desirable that / depends on {di, . . . , d{\ 
jointly. 
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3.2 Scores and Loss 



Winner Takes All (WTA): If the first document is relevant, i.e. if y(ir)i = n the score is 1, 
otherwise 0. We may rewrite this as 

WTA(7r,y) = <a(7r),6(y)) (9) 
where on = 5(i, 1) and bi = 8(ri, r{). 

Note that there may be more than one document which is considered relevant. In this 
case WTA(7r, y) will be maximal for several classes of permutations. 

Mean Reciprocal Rank (MRR): We assume that there exists only one top ranking docu- 
ment. We have 

MBB(7r,y) = <o(7r),6(y)) (10) 
where cij = 1/i and bi = 5(i, 1). 

In other words, the reciprocal rank is the inverse of the rank assigned to document di, 
the most relevant document. MRR derives its name from the fact that this quantity is 
typically averaged over several queries. 

Discounted Cumulative Gain (DCG and DCG@n): WTA and MRR use only a single 
entry of n, namely 7r(l), to assess the quality of the ranking. Discounted Cumulative 
Gains are a more balanced score: 

DCG(7r,y) = (a(7r),6(y)) (11) 
where = l/log(i + 1) and bi = 2 Ti — 1. 

Here it pays if a relevant document is retrieved with a high rank, as the coefficients a, 
are monotonically decreasing. Variants of DCG, which do not take all ranks into account, 
are DCG@n. Here ctj = l/log(i + 1) if i < n and a, = otherwise. That is, we only care 
about the n top ranking entries. In search engines the truncation level n is typically 10, 
as this constitutes the number of hits returned on the first page of a search. 

Normalized Discounted Cumulative Gain (NDCG): A downside of DCG is that its nu- 
merical range depends on y (e.g. a collection containing many relevant documents will 
yield a considerably larger value at optimality than one containing only irrelevant ones). 
Since y is already sorted it follows that DCG is maximized for the identity permutation 
tt= 1: 

NDCG(vr, y ) :=§§§[§) (12) 
andNDCG@n(vr, y) 

This allows us to define 

NDCG(7r,2/) = (a(7r),%)) (13) 
where a,; = ; — — and bi = 



log(i + l) DCG(l,y)' 
Finally, NDCG@n, the measure which this paper focuses on, is given by (a(7r), b(y)) where 

„. _ J log(lfi) lf *- n andfr - ^=1 (U) 
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Precision@n: Note that this measure, too, can be expressed by (a(ir), b(y)). Here we define 




1/n if i < n 
else 



and hi = 



1 if ri correct 
else 



(15) 



The main difference to NDCG is that Precision@n has no decay factor, weighing the top 
n answers equally. 

Expected rank utility: It has an exponential decay in the top ranked items and it can be 
represented as 



Here d is a neutral vote and a is the viewing halflife. The normalized ERU can also be 
defined in a similar manner to NDCG. The (normalized) ERU is often used in collaborative 
filtering for recommender systems where the lists of items are often very short. 

It is commonly accepted that NDCG@n is a good model of a person's judgment of a search 
engine: the results on the first page matter, between them there should be a decay factor. 
NDCG has another advantage that it is more structured and more general than WTA and MRR. 
For collaborative and content filtering, ERU is more popular [Breese et al., 1998, Basilico and 
Hofmann, 2004]. 

Since we set out to design a loss function as described in Section 2.1, we now define the 
relative loss incurred by any score of the form (a(-7r), b{y)). For convenience we assume (again) 
that 7T = 1 is the optimal permutation: 



3.3 Scoring Function 

The final step in our problem setting is to define a suitable function f(x, ir) (where x = (q, D)) 
which is maximized for the "optimal" permutation. As stated in Section 3.1 we require a function 
g(d, q) which will assign a relevance score to every (document, query) pair independently at test 
time. The Polya-Littlewood-Hardy inequality tells us how we can obtain a suitable class of 
functions /, given g: 

Theorem 2 Let a,b G W 1 and let f 6 II. Moreover, assume that a is sorted in decreasing 
order. Then (a,b(w)) is maximal for the permutation sorting b in decreasing order. 

Consequently, if we define 



for some decreasing sequence c, the maximizer of / will be the one which sorts the documents in 
decreasing order of relevance, as assigned by g(di, q). The expansion (18) also acts as a guidance 
when it comes to designing a feature map 3>(x, ir), which will map all documents di, the query 
q, and the permutation it jointly into a feature space. More to the point, it will need to reflect 
the decomposition into terms related to individual pairs (di,q) only. 



ERU(7r,y) = <o(7r),6(y)) 

l-i 

where a\ = 2 a ~ l and bi = max(r^ — d, 0) 



(16) 



A(y,7r) := <a(l),6(y))-(o(7r),6(y)). 



(17) 




(18) 
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3.4 General Position Dependent Loss 



Before we proceed to solving the ranking problem by denning a suitable feature map, let us 
briefly consider the most general case we are able to treat efficiently in our framework. 2 As- 
suming that we are given a ranking ir of documents {d±, . . . , d/}, we define the loss as 

i 

A ( 7r i V) ■= ^2 Kijdjiy) + const. (19) 

That is, for every position j we have a cost Cij(y) which is incurred by placing document i at 
position j. Clearly (17) falls into this category: simply choose Cij = aib(y)j. For instance, we 
might have a web page ranking problem where the first position should contain a result from 
a government-related site, the second page should contain a relevant page from a user-created 
site, etc. In other words, this setting would apply to cases where specific positions in the ranked 
list are endowed with specific meanings. 

The problem with this procedure is, however, that estimation and optimization are somewhat 
more costly than merely sorting a list of relevance scores: we would want to have a different 
scoring function for each position. In other words, the computational cost is dramatically 
increased, both in terms of the number of functions needed to compute the scores of an element 
and in terms of the optimization required to find a permutation which minimizes the loss A(7r, y). 

4 Learning Ranking 

4.1 The Featuremap 

We now expand on the ansatz of (18). The linearity of the inner product f(x, ir) = (3>(x, w),w), 
as given by (3) requires that we should be able to write as 

l 

$(x, vr) = ^ c(ir)i(/)(di, q) where c G M m . (20) 
i=i 

In this case (w,$(D,q,Tr)) = (c(ir),pi) where pi = (w, <f>(di, q)) . The problem of choosing 4>(di, q) 
is ultimately data dependent, as we do not know in general what data type the documents and 
queries are composed of. We will discuss several choices in the context of the experiments in 
Section 6. 

Eq. (20) implies that / is given by 

l 

f{x, vr) = (*(z, tt), w) = c ^)i (<M d *> «)' w ) ■ ( 21 ) 

Hence we can apply the Polya-Littlewood-Hardy inequality and observe that it is maximized by 
sorting the terms {4>{di, q),w) in decreasing order just as c. Note that this permutation is easily 
obtained by applying QuickSort in 0(llogl) time. To obtain only the n top-ranking terms we 
can do even better and only need to expend 0(1 + nlogn) time, using a QuickSort-style divide 
and conquer procedure. 

This leaves the issue of choosing c. In general we want to choose it such that a) the margin 
of (5) is maximized and that b) the average length ||$(x,7r)|| 2 is small for good generalization 
and fast convergence of the implementation. 

Since $ is linear in c, we could employ a kernel optimization technique to obtain an optimal 
value of c, such as those proposed in [Bousquet and Herrmann, 2002, Ong et al., 2003]. While in 

2 More general cases, such as a quadratic dependency on positions, while possible, will typically lead to 
optimization problems which cannot be solved in polynomial time. 
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principle not impossible, this leads rather inevitably to a highly constrained (at worst semidef- 
inite) optimization problem in terms of c and w. Obtaining an efficient algorithm for this more 
general problem is topic of current research. We report experimental results for different choices 
of c in Section 6. 

The above reasoning is sufficient to apply the optimization problem described in Section 2 
to ranking problems. All that changes with respect to the general case is the choice of loss 
function A and the feature map 3>(x, 7r). In order to obtain an efficient optimization algorithm 
we need to overcome one last hurdle: we need to find an efficient algorithm for finding constraint 
violators in (8). 

4.2 Finding Violated Constraints 

Recall (8). In the context of ranking this means that we need to find the permutation ir which 
maximizes 

(&(x,ir),w) + A(j/,7r) 
i 

=J2 (<t>(di, q),w) c(n) t + (a(l), b(y)) - {a(ir),b(v)) (22) 
i=i 

= (c(vr), g) - (a(vr), b{y)) + const. (23) 

where gi = (4>(di,q),w). Note that (23) is a so-called linear assignment problem which can 
be solved by the Hungarian Marriage method: the Kuhn-Munkres algorithm in cubic time. 
Maximizing (23) amounts to solving 

m 

argmax N d w . where Qj = cjgi — ajb{. (24) 
i\ 

Note that there is a special case in which the problem (8) can be solved by a simple sorting 
operation: whenever a = c the problem reduces to maximizing (a(ir),g — b(y)). This choice, 
however, is not always desirable as a may be rather degenerate (i.e. it may contain many terms 
with value zero) . 

4.3 Solving the Linear Assignment Problem 

It is well known that there exists a convex relaxation of the problem of maximizing Y^i Cj,7r, 
into a linear problem which leads to an optimal solution. 

maximize tr-7r T C (25) 

•zr 

subject to ^^vrjj = 1 and = 1 and 7Tjj G {0, 1} 

» 3 

More specifically, the integer linear program can be relaxed by replacing the integrality con- 
straint G {0, 1} by TTij > without changing the solution, since the constraint matrix is 
totally unimodular. Consequently the vertices of the feasible polytope are integral, hence also 
the solution. The dual is 

minimize Ui + Vi subject to m + Vi > Cij. (26) 

i 

The solution of linear assignment problems is a well studied subject. The original papers by 
Kuhn [1955] and Munkres [1957] implied an algorithm with 0(l 5 ) cost in the number of terms. 
Later, Karp [1980] suggested an algorithm with expected quadratic time in the size of the 
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Algorithm 2 Direct Optimization of Ranking Measures 

Input: Document collections Di, queries ranks yi, sample size m, tolerance e 

Initialize Si = for all i, and w = 0. 

repeat 

for i = 1 to m do 

™ = Ei E^ <*Mr$(a*, tt) 
vr* = argmax^gn (u/, $(x;, vr)) + A(vr, j/j) 
£ = max(0,max 7r6 5 i ) (to, 7r)) + A(7r,yj) 
if (w, *(xi, tt*)) + A(vr*, yi) > £ + e then 

Increase constraint set Si <— Si U 7r* 

Optimize (7) using only where ir £ Si. 
end if 
end for 

until S has not changed in this iteration 



assignment problem (ignoring log-factors). Finally, Orlin and Lee [1993] propose a linear time 
algorithm for large problems. Since in our case the number of pages is fairly small (in the order 
of 50 to 200), we used an existing implementation due to Jonker and Volgenant [1987]. See 
Section 6.3 for runtime details. The latter uses modern techniques for computing the shortest 
path problem arising in (26). 

This means that we can check whether a particular set of documents and an associated query 
(Di,qi) satisfies the inequality constraints of the structured estimation problem (5). Hence we 
have the subroutine necessary to make the algorithm of Section 2 work. In particular, this is 
the only subroutine we need to replace in SVMStruct [Tsochantaridis et al., 2005]. 

4.4 In a Nutshell 

Before describing the experiments, let us briefly summarize the overall structure of the algo- 
rithm. In completely analogy to (5) the primal optimization problem can be stated as 

j m 

minimize - \\w\\ 2 + C £j (27a) 
w,£ 2 

i=i 

subject to (w, $(xi, 1) - $(xi, n)) > Afa, tt) - & 

for all & > and tt € II and i € {1, . . . , m] (27b) 

The dual problem of (5) is given by 

minimize - ^ k((xi,n), (xj,ir')) 

- A( yi ,ir)a i7r (28a) 

subject to ^ y ai n < C and a i7T > for all i and tt. (28b) 
Tren 

This problem is solved by Algorithm 2. Finally, documents on a test set are ranked by g(d, q) = 
(w, cf)(d, q)), where w = J2i, n a ™®( x u n). 
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5 Extensions 



5.1 Diversity Constraints 

Imagine the following scenario: when searching for 'Jordan', we will find many relevant webpages 
containing information on this subject. They will cover a large range of topics, such as a 
basketball player (Mike Jordan), a country (the kingdom of Jordan), a river (in the Middle 
East), a TV show (Crossing Jordan), a scientist (Mike Jordan), a city (both in Minnesota and 
in Utah), and many more. Clearly, it is desirable to provide the user with a diverse mix of 
references, rather than exclusively many pages from the same site or domain or topic range. 

One way to achieve this goal is to include an interaction term between the items to be 
ranked. This leads to optimization problems of the form 



KijTTklCijM (29) 



minimize 

wen 

ijkl 

where Cij ki would encode the interaction between items. This is clearly not desirable, since prob- 
lems of the above type cannot be solved in polynomial time. This would render the algorithm 
impractical for swift ranking and retrieval purposes. 

However, we may take a more pedestrian approach, which will yield equally good perfor- 
mance in practice, without incurring exponential cost. This approach is heavily tailored towards 
ranking scores which only take the top n documents into account. We will require that among 
the top n retrieved documents no more than one of them may come from the same source (e.g. 
topic, domain, subdomain, personal homepage). Nonetheless, we would like to minimize the 
ranking scores subject to this condition. Formally, we would like to find a matrix tt € {0, l}' xra 
such that 

7T ij a i b(y) j (30) 

is maximized, subject to the constraints 

^2 TTij = 1 for all j G {1, n} (31) 

i 

Y Yl n v - 1 for a11 Bs - ( 32 ) 

Here the disjoint sets B s which form a partition of correspond to subsets of doc- 

uments (or webpages) which must not be retrieved simultaneously. In the above example, 
for instance all webpages retrieved from the domain http://www.jordan.govoffice.com would 
be lumped together into one set B s . Another set B s i would cover, e.g. all webpages from 
http://www. cs. berkeley. edu/^jordan/. 

It is not difficult to see that during training, we need to solve an optimization problem of 
the form 

maximize tvir T C (33a) 

7T 

subject to ^2 TTij = 1 and ^ ^ 7r^ < 1 where tt G {0, l} lxn . (33b) 

i j i<=B a 

We will show below that the constraint matrix of (33) is totally unimodular. This means that 
a linear programming relaxation of the constraint set, i.e. the change from tt^ & {0, 1} to 
TTij G [0, 1] will leave the solution of the problem unchanged. This can be seen as follows: 

Theorem 3 (Heller and Tompkins [1956]) An integer matrix A with A^j G {0, ±1} is to- 
tally unimodular if no more than two nonzero entries appear in any column, and if its rows can 
be partitioned into two sets such that: 
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1. If a column has two entries of the same sign, their rows are in different sets; 



2. If a column has two entries of different signs, their rows are in the same set. 
Corollary 4 The linear programming relaxation of (33) has an integral solution. 

Proof All we need to show is that in (33b) each term ir^ only shows up exactly twice with 
coefficient 1. This is clearly the case since B s is a partition of {1, . . . ,/}, which accounts for 
one occurrence, and the assignment constraints which account for the other occurrence. Hence 
Theorem 3 applies. ■ 

Note that we could extend this further by requiring that weighted combinations over topics 
^2ijftijWi s < 1, where now the weights Wi s may be non-integral and the domains where Wi s 
is nonzero might overlap. In this case, obviously the optimization problem cannot be relaxed 
easily any more. However, it will still provide useful results when used in combination with 
integer programming codes, such as Bonmin [Bonami et al., 2005]. 

Finally, note that at test stage, it is very easy to take the constraints (33b) into account: 
Simply pick the highest ranking document from each set B s and use the latter to obtain an 
overall ranking. 

5.2 Ranking Matrix Factorization 

An obvious application of our framework is matrix factorization for collaborative filtering. The 
work of Srebro and Shraibman [2005], Rennie and Srebro [2005], Srebro et al. [2005b] suggests 
that regularized matrix factorizations are a good tool for modeling collaborative filtering appli- 
cations. More to the point, Srebro and coworkers assume that they are given a sparse matrix 
X arising from collaborative filtering, which they would like to factorize. 

More specifically, the entries Xij denote ratings by user i on document /movie/object j. The 
matrix X £ ]£ mxn i s assumed to be sparse, where zero entries correspond to (user, object) pairs 
which have not been ranked yet. The goal is to find a pair of matrices U, V such that UV T is 
close to X for all nonzero entries. Or more specifically, such that the entries [?7V ]ij can be 
used to recommend additional objects. 

However, this may not be a desirable approach, since it is, for instance, completely irrelevant 
how accurate our ratings are for undesirable objects (small X^), as long as we are able to 
capture the users preferences for desirable objects (large X^) accurately. In other words, we 
want to model the user's likes well, rather than his dislikes. In this sense, any indiscriminate 
minimization, e.g. of a mean squared error, or a large margin error for X^ is inappropriate. 

Instead, we propose to use a ranking score such as those proposed in Section 3.2 to evaluate 
an entire row of X at a time. That is, we want to ensure that X^ is well reflected as a whole 
in the estimates for all objects j for a fixed user i. This means that we should be minimizing 



where A is defined as in (17) and it is understood that it is evaluated over the nonzero terms 
of X^ only. This is a highly nonconvex optimization problem. However, we can, again, find a 
convex upper bound by the methods described in (5), yielding a function R emp [U,V,X]. The 
technical details are straightforward and therefore omitted. 

Note that by linearity this upper bound is convex in U and V respectively, whenever the 
other argument remains fixed. Moreover, note that R[U, V, X] decomposes into m independent 
problems in terms of the users J7j., whenever V is fixed, whereas no such decomposition holds 
in terms of V. 

In order to deal with overfitting, regularization of the matrices U and V is recommended. 

1 1 1 1 2 1 1 1 1 2 

The trace norm ||f/||^p + ||V||^ can be shown to have desirable properties in terms of general- 
ization [Srebro et al., 2005a]. This suggests an iterative procedure for collaborative filtering: 




(34) 
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• For fixed V solve m independent optimization problems in terms of \J{. , using the Frobenius 
norm regularization on U. 

• For fixed U solve one large-scale convex optimization problem in terms of V. 

Since the number of users is typically considerably higher than the number of objects, it is 
possible to deal with the optimization problem in V efficiently. Details are subject to future 
research. 

6 Experiments 

We address a number of questions: a) is learning needed for good ranking performance, b) how 
does DORM (our algorithm) perform with respect to other algorithms on a number of datasets 
of different size and truncation level of the performance criteria, c) how fast is DORM when 
compared to similar large margin ranking algorithms, and d) how important is the choice of c 
for good performance? 

6.1 Datasets and Experimental Protocol 

UCI We choose PageBlock, PenDigits, OptDigits, and Covertype from the UCI repository 
mainly to increase the number of different datasets on which we may compare DORM to 
other existing approaches. Since they are not primary ranking data, we will discuss the 
outcomes only briefly for illustrative purposes. For PageBlock, PenDigits and OptDigits 
we sample 50 queries and 100 documents for each query. For Covertypes we sample 500 
queries and 100 documents. 

Web Seach Our web search dataset (courtesy of Chris Burges at Microsoft Research) consists 
of 1000 queries each for training, validation and testing. They are provided and selected 
from a larger pool of training data used for a search engine. Figure 1 shows a histogram 
of the number of documents per query (the median is approximately 50). Documents are 
ranked according to five levels of relevance (l:Bad, 2:Fair, 3:Good, 4:Excellent, 5:Perfect). 
Unlabelled documents are treated as Bad. The ratio between the five categories (1 to 5) 
is approximately 75:17:15:2:1. The length of the feature vectors is 367 (i.e. we are using 
BM25). We evaluate our algorithm with respect to three goals: performance in terms of 
NDCG@n, MRR and WTA performance. 

EachMovie This collaborative filtering dataset consists of 2811983 ratings by 72916 users on 
1628 movies. To prove the point that our improvement is not due to an improved choice 
of a kernel but rather in the improved choice of a loss function, we follow the experimental 
setup, choice of kernels, and pre-processing of [Basilico and Hofmann, 2004] and compare 
performance using ERU. We also use the experimental setup of [Yu et al., 2006] and 
compare performance using NDCG and NDCG@10. In both cases we are able to improve 
the results considerably. The datasets for both experiments are as provided by Thomas 
Hofmann at Google Research, and Shipeng Yu at Siemens Research respectively. 

Protocol Since WebSearch provided a validation set, we used the latter for model selection. 
Otherwise, 10-fold cross validation was used to adjust the regularization constant C. We 
used linear kernels throughout, except for the EachMovie datasets, where we followed the 
protocols of [Basilico and Hofmann, 2004] and [Yu et al., 2006]. This was done to show 
that the performance improvement we observe is due to our choice of a better loss function 
rather than the function class. Note that NDCG, MRR were rescaled from [0, 1] to [0, 100] 
for better visualization. 



14 



Dataset 


ROC Area 


SVM 


Prec@10 


DORM 


PageBlocks 
PenDigits 
OptDigits 
Covertypes 


35.9 ± 7 
26.2 ±8 
26.0 ±9 
47.0 ±2 


46.5 ±7 
41.5 ±4 
26.1 ±3 
48.5 ±2 


44.0 ± 7 
15.6 ±3 
26.2 ±3 

42.1 ± 1 


63.7 ±6 
85.2 ±3 
76.1 ±6 

58.8 ±2 



Table 1: Performance on UCI data. Bold indicates a high significance of p < 0.0001 by paired 
t-test. 
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Figure 1: Number of documents per query. 



6.2 UCI Datasets 

Since the UCI data does not come in the form of multiple queries, we permute the datasets and 
randomly subsample documents for each query. 

We compare DORM against SVM classification, Precision@n and ROCArea as reported 
by [Joachims, 2005] and use NDCG@10 as a performance measure. Table 1 shows that our 
algorithm significantly outperforms other methods, c was chosen to be Cj = + 1) in DORM. 
The experiment results show how hard it is to optimize NDCG, traditional methods perform 
poorly when we use NDCG as the performance metric while direct optimization of the correct 
criterion works very well. 



6.3 NDCG for Web Search 

We compare DORM to a range of kernel methods for ranking: multiclass SVM classifiers, SVM 
for ordinal regression (RSVM) [Joachims, 2002, Herbrich et al., 2000], SVM for information 
retrieval (SVM-IR-QP) [Cao et al., 2006], Precision@n, ROCArea [Joachims, 2005] and DORM. 
We use NDCG to assess the performance of the ranking algorithms. BM25 [Robertson et al., 
1994, Robertson and Hull, 2000] is used as a baseline (it also constitutes the feature vector for 
the other algorithms). 

In the first experiment, we use the full training set (parameter selection on the validation 
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Figure 2: NDCG@n scores on the web search dataset at different truncation levels n. 




■ BM25 

■ SVM 

□ RSVM 

□ RSVM-IR-QP 

■ prec@10 

■ ROCArea 

■ DORM 



300 500 700 

number of queries 



Figure 3: NDCG@10 scores on the web search dataset for different sample sizes. Note that the 
performance of NDCG on 100 observations is larger than of any other competing method for 
1000 observations. 
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Figure 4: Maximum difference in NDCG@10 for different choices of c with respect to 1/y/m. 



set) to train our model and report the performance on the designated test set. We report the 
average prediction results for NDCG with various truncation levels ranging from 1 to 10 in 
Figure 2. Note that DORM consistently outperforms other methods for ranking by 2 — 3%. We 
chose Cj = (i + l) -1 ^ 2 . 

In a second experiment (same choice of a), we investigate the effects of increasing training 
set size using the first m =100, 300, 500, 700 or all 1000 queries for training. Using the same 
experimental protocol as above we report the average prediction results for NDCG@10 in Figure 
3. Our results confirm that the same gain for NDCG@10 can be achieved if we increase the 
training set size. Note that for many methods, doubling the sample size increases NDCG by 
less than 1% (Figure 3). DORM achieves the same gain without the need to double m. In fact, 
DORM using 100 queries for training beats all other methods using 1000 queries! 

Since expert-judged datasets can be very expensive, DORM is more cost-effective. This 
confirms that learning to rank is beneficial, as using a combination of many features for ranking 
algorithms could result in better performance than using one or some several features individ- 
ually, such as BM25. 

The choice of c critically determines the feature map. At the moment, we have little theoret- 
ical guidance with regard to this matter, hence we investigated the effect of choosing schemes 
of c experimentally. Clearly c needs to be a monotonically decreasing function. We chose 
d = (i + l)~ d for d G {i,i,i,l,2,3} and a = l/log(i + 2) and a = l/loglog(i + 2). 

We found experimentally that the differences between the various schemes are not as dra- 
matic as the improvement obtained by using DORM instead of other algorithms. To summarize 
the results we show the difference in performance in Figure 4 for NDCG@10. Note that the dif- 
ference in terms of NDCG accuracy resulted by taking different c will decrease when the sample 
size increases. The rate of convergence is suspected to be 1/y/m. An possible interpretation is 
that the choice of c can be considered prior knowledge. Thus with increasing sample size, we 
will need to rely less on this prior knowledge and a reasonable choice of c will suffice. 
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6.4 MRR and WTA for Web Search 



MRR DORM is effective not only for NDCG but also for other performance measures. We 
compare it using the Mean Reciprocal Rank (MRR) and Winner Takes All (WTA) scores on 
the same dataset. For comparison we use Precision@n (with n = 3, since this yielded the best 
results experimentally), DORM minimizing NDCG (which in this case is the incorrect criterion), 
and the previous methods. 

As before, we use the validation set to adjust the regularization parameter C. We picked 
Cj = + 1) (the influence of the particular choice of c was rather minor, as demonstrated in 
Figure 4). The average results for varying sample size (from 100 to 1000) are reported in Figure 
5. 

It can be seen from Figure 5 that DORM for MRR outperforms all other models including 
DORM for NDCG. This is not surprising, since DORM for MRR optimizes MRR directly while 
other methods do not. DORM for MRR beats other methods by 1% to 2% which is quite 
significant given that if we double the dataset size, the gain is only around 1%. The fact that 
the gains in optimizing MRR are less than when optimizing MRR is probably due to the fact 
that MRR is less structured than NDCG. 

WTA This is the least structured of all scores, as it only takes the top ranking document 
retrieved into account. This means that for a ranking dataset where 5 different degrees of 
relevance are available, only the top scoring ones are chosen. This transformation discards a 
great deal of information in the labels (i.e. the gradations among the lower-scoring documents), 
which leads to the suspicion that minimizing a related cost function taking all levels into ac- 
count should perform better. It turns out, experimentally, that indeed a direct optimization 
of the WTA loss function leads to bad performance. In order to amend this issue, we decided 
to minimize a modified NDCG score instead of the straight WTA score. This improved the 
performance significantly. 

The truncation level in the NDCG@n score should be closer to 1 rather than 10. We devise 
a heuristic to find the truncation level: for queries that have more than 2 level of preference at 
the top 3 items, the truncation is 3; for queries that have only one level of preference at the top 
3 items, the truncation level is after position where the next level of preference appears in the 
ranked list, as for documents with a large number of top ranking documents we want to include 
at least one lower-ranking document in the list. We call the method mWTA (modified DORM 
for WTA). 

Experiments with different decay terms for c indicated that Cj = 1/ y/i + 1 yields best results. 
We compare the new method with various methods using model selection on the validation set 
and report the total number of correct predictions in Figure 6. RSVM and RSVM-IR-QP 
perform poorly on this task and we omit their results due to space constraints. While direct 
optimization of WTA is unsatisfactory, mWTA outperforms other methods significantly. 

6.5 Runtime Performance 

One might suspect that our formulation is potentially slow, as the Hungarian marriage algorithm 
takes cubic time 0(l s ). However, each such optimization problem is relatively small (on average 
in the order of 50 documents per query), which means that the overall computational time is 
well controlled. 

For practical results we carried out experiments to measure the time for training plus cross 
validation. We used a modified version of SVMStruct [Tsochantaridis et al., 2005]. The algo- 
rithms are all written in C and the code was run on a Pentium 4 3.2GHz workstation with 1GB 
RAM, running Linux and using GCC 3.3.5. As can be seen in Figure 7, DORM outruns most 
other methods, except for multiclass SVM, and BM25 (which does not require training). Note 
that ordinal regression algorithms are significantly slower than DORM, as they need to deal 
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Figure 5: MRR scores on the web search dataset for different sample sizes. We compare eight 
methods (including DORM), using a = + 1). 
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Figure 6: WTA scores ("I'm feeling lucky") on the web search dataset for different sample 
sizes. DORM (mWTA) minimizing an adaptive version of NDCG outperforms straight WTA 
minimization, as it makes better use of the label information. 
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Figure 7: Runtime of DORM (NDCG and MRR optimization) vs. other algorithms, ignoring 
file 10. 



with a huge number of simple inequalities rather than a smaller number of more meaningful 
ones. 

Table 2 has a comparison of the number of support vectors (the more the slower) number 
of column generation iterations (the more the slower), percent of time spent in the QP solver. 
DORM is faster than other algorithms since it has a sparser solution. In terms of number 
of iterations a percent of time in the QP solver, DORM is a well balanced solution between 
Precision@n and ROCArea. Results are similar when optimizing MRR (this is reported in the 
bottom of Table 2). Note that since all models use linear functions, prediction times is less than 
0.5s for 1000 queries. 

6.6 EachMovie and Collaborative Filtering 

ERU Past published results on collaborative filtering use Expected Rank Utility (ERU), 
NDCG and NDCG@10 as reference scores. In order to show that the improvement in per- 
formance is truly due to a better loss function rather than a different kernel we use the same 
kernels and experimental protocol as proposed by [Basilico and Hofmann, 2004] using the same 
parameter combinations in the context of ERU. Table 3 shows the merit of DORM: it out- 
performs JRank [Basilico and Hofmann, 2004] and PRank [Crammer and Singer, 2002]. In 
experiment 1 we used user features in combination with item correlations. In experiment 2 we 
used item features in combination with user ratings. In both cases, results are averaged over 
100 trials with 100 training users, 2000 input users and 800 training items. 

Having used a kernel which is optimal for JRank we expect that optimizing the kernel further 
would lead to better results, as there is no reason to assume that the model class optimal for 
JRank would be the best choice for DORM, too. 

NDCG In a second experiment, we mimicked the experimental protocol of [Yu et al., 2006] on 
EachMovie. Here, we treat each movie as a document and each user as a query. After filtering 
out all the unpopular documents and queries (as in [Yu et al., 2006]) we have 1075 documents 
and 100 users. 

For each user, we randomly select 10, 20 and 50 labeled items for training and perform 
prediction on the rest. The process is repeated 10 times independently. The methods for 
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NDCG@10 optimization 


Method 


# SVs 


# Iter 


% in QP 


Precision® 10 


1037 


44 


89.02 


ROCArea 


997 


12 


19.64 


DORM (NDCG) 


561 


22 


28.76 




MRR optimization 


Method 


#SVs 


# Iter 


% in QP 


Precision@3 


1000 


11 


46.80 


ROCArea 


997 


12 


19.64 


DORM (NDCG) 


520 


23 


61.66 


DORM (MRR) 


550 


17 


1.76 



Table 2: Number of support vectors, iterations in column generation, and time spent in the 
Quadratic Programming loop for various SVM style optimization algorithms. Top: optimiza- 
tion for NDCG@10, using q = + Bottom: optimization for MRR and NDCG, using 
Cj = + 1). Corresponding methods have different numbers due to the different choices of 
truncation level (Precision@n) and c (DORM NDCG). 



Experiment 


PRank 


J Rank 


DORM 


1 


70.8 


75.3 


76.5 ±0.43 


2 


73.4 


76.2 


76.7 ±0.32 



Table 3: Expected rank utility scores for three methods. Results averaged over 100 trials with 
100 training users, 2000 input users and 800 training items. 



N 


Method 


NDCG 


NDCG@10 


10 


GPR 


83.41 ±0.22 


45.58 ± 1.51 




CGPR 


86.39 ±0.24 


57.34 ± 1.44 




GPOR 


80.59 ±0.03 


36.92 ± 0.25 




CGPOR 


80.83 ±0.11 


37.89 ± 1.05 




MMMF 


84.34 ± 0.48 


47.46 ± 3.42 




DORM 


87.17 ±0.24 


61.75 ± 1.83 






p < 0.0001 


p < 0.0001 


20 


GPR 


84.12 ±0.15 


48.49 ± 0.66 




CGPR 


86.98 ±0.16 


59.89 ± 1.18 




GPOR 


80.48 ±0.05 


36.78 ± 0.30 




CGPOR 


80.78 ±0.13 


37.81 ± 0.56 




MMMF 


84.85 ±0.28 


47.86 ± 1.39 




DORM 


87.63 ±0.37 


62.82 ±1.9 






p < 0.0001 


p = 0.0006 


50 


GPR 


85.15 ±0.23 


53.75 ± 0.89 




CGPR 


87.82 ±0.21 


63.41 ± 1.14 




GPOR 


80.10 ±0.04 


36.63 ± 0.24 




CGPOR 


80.45 ±0.06 


37.74 ± 0.41 




MMMF 


86.13 ±0.38 


54.78 ±2.11 




DORM 


87.84 ±0.32 


65.05 ± 1.27 






p = 0.8706 


p = 0.006 



Table 4: NDCG optimization on EachMovie dataset. Comparison between 6 methods using 
unpaired i-test with values of p shown (best score vs. second best score). 
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comparison are the standard Gaussian process regression (GPR) [Rasmussen and Williams, 
2006], Gaussian Process ordinal regression (GPOR) [Chu and Ghahramani, 2005], and their 
collaborative extensions (CPR, CGPOR) [Yu et al., 2006], MMMF [Rennie and Srebro, 2005] 
and DORM (for NDCG). The figures for the first 5 methods are extracted from [Yu et al., 2006] 
and scaled by 100 to fit our convention of showing NDCG results. We perform an unpaired 
i-test for significance (see Table 4). 

Note that there is no cross validation or model selection involved in [Yu et al., 2006]. Thus 
to be fair, we fix the following parameters q = (i + l)~ - 25 (performing slightly better in this 
dataset), C = 0.01 and n = 10 (the truncation level of NDCG). The results show that DORM 
performs very well for predicting the ranking for new items, especially when the number of 
labeled items is small. 

7 Summary and Discussion 

In this paper we proposed a general scheme to deal with a large range of criteria commonly 
used in the context of web page ranking and collaborative filtering. Unlike previous work, 
which mainly focuses on pairwise comparisons we aim to minimize the multivariate performance 
measures (or rather a convex upper bound on them) directly. This has both computational 
savings, leading to a faster algorithm and practical ones, leading to better performance. In a 
way, our work follows the mantra of [Vapnik, 1982] of estimating directly the desired quantities 
rather than optimizing a surrogate function. There are clear extensions of the current work: 

• The key point of our paper was to construct a well-designed loss function for optimization. 
In this form it is completely generic and can be used as a drop-in replacement in many 
settings. We completely ignored language models [Ponte and Croft, 1998] to parse the 
queries in any sophisticated fashion. 

• Although the content of the paper is directed towards ranking, the method can be gener- 
alized for optimizing many other complicated multivariate loss functions. 

• We could use our method directly for information retrieval tasks or authorship identifica- 
tion queries. In the latter case, the query qi would consist of a collection of documents 
written by one author. 

• We may add personalization to queries. This is no major problem, as we can simply add 
some personal data Ui to (j)(qi,di,Ui) and obtained personalized ranking. 

• Online algorithms along the lines of [Shalev-Shwartz and Singer, 2006] can easily be ac- 
commodated to deal with massive datasets. 

• The present algorithm can be extended to learn matching problems on graphs. This is 
achieved by extending the linear assignment problem to a quadratic one. The price one 
needs to pay in this case is that the Hungarian Marriage algorithm is no longer feasible, 
as the optimization problem itself is NP hard. 

Note that the choice of a Hilbert space for the scoring functions is done for reasons of conve- 
nience. If the applications demand Neural Networks or similar (harder to deal with) function 
classes instead of kernels, we can still apply the large margin formulation. That said, we find 
that the kernel approach is well suited to the problem. 
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